Predicting Stock Market Volatility with Machine Learning
Spring 2025 Capstone Project
Author
Kevin Izadi
Financial market data visualization with candlestick charts and trading indicators. Credit: Nicholas Cappello on Unsplash
Project Overview
This capstone project explores machine learning approaches to predict stock market volatility, focusing specifically on the SPY ETF that tracks the S&P 500 index. By leveraging historical price data and technical indicators, I’ve developed models that forecast future volatility with measurable accuracy.
In today’s unpredictable financial world, anticipating market volatility can provide significant strategic advantages for traders, portfolio managers, and risk analysts. My project began with a fundamental question: can historical patterns reliably predict future market turbulence?
Background
Understanding and forecasting volatility presents a unique challenge due to the complex, non-linear nature of financial markets. Unlike price prediction, volatility forecasting focuses on the magnitude of price movements rather than their direction.
Traditional finance theory suggests markets should be efficient and largely unpredictable. However, decades of research have revealed persistent patterns in volatility behavior, particularly the tendency of volatility to cluster – periods of high turbulence often follow other volatile periods, while calm markets frequently remain stable for extended intervals.
I chose to focus on the SPY ETF because it represents the broad U.S. market, offering high liquidity, extensive historical data, and significance for portfolio management and derivatives pricing. As the world’s most heavily traded ETF, it provides an ideal testing ground for volatility prediction models that might later be extended to other securities.
Problem Statement
This study investigates whether advanced machine learning techniques can identify patterns in historical market data to forecast future volatility more accurately than traditional statistical methods. The central research questions are:
Can neural networks effectively capture the non-linear dynamics of market volatility?
Which technical and fundamental features provide the most predictive value?
How does model performance vary across different market regimes?
What practical applications emerge from improved volatility forecasts?
The research has implications for options pricing, risk management, portfolio construction, and trading strategy development. Accurate volatility forecasts could help investors better time their hedging activities, optimize portfolio allocations, and potentially develop trading strategies that capitalize on expected changes in market turbulence.
Note
Throughout this project, I maintained strict separation between training and testing data to prevent look-ahead bias – a critical consideration in financial modeling.
Data Collection and Preparation
I collected comprehensive historical data for the SPY ETF from 2010 to 2025 using the yfinance API. My dataset includes daily price information (Open, High, Low, Close, Volume) along with VIX index values to capture market sentiment around volatility.
I specifically chose this extended timeframe to expose my models to diverse market conditions - from the post-financial crisis recovery and the bull market of the 2010s to the COVID-19 crash and recovery, as well as periods of both extreme calm and heightened turbulence.
Data Collection Process
The collection process was straightforward but thorough. I pulled historical SPY data through yfinance, making sure I had complete coverage from 2010 through 2025. This gave me the core price and volume metrics I needed. I then supplemented this with VIX index values, which offer a great measure of expected market volatility.
From there, I calculated various technical indicators using financial analysis libraries - things like moving averages, momentum indicators like RSI and MACD, and several volume metrics.
Dataset Size and Missing Data Handling
My final dataset ended up with about 3,500 trading days spanning 14 years. Missing values weren’t a major issue in the SPY price data (less than 0.3% of observations), though they were slightly more common in some derived indicators and the VIX data (around 1.2%). To handle these gaps, I used forward-fill imputation for price data, a mix of forward-fill and interpolation for VIX values, and careful calculation of derived features.
Data Preprocessing Steps
Working with financial time series requires careful preprocessing to maintain integrity and prevent information leakage. I addressed this by:
Using forward fill imputation for missing values
Scaling features with MinMaxScaler, making sure to determine parameters only from training data
Implementing a chronological train-test split rather than random sampling
Carefully constructing features to avoid any potential data leakage
Required Packages for Implementation
The following packages were used for data collection, preprocessing, modeling, and visualization:
Required Packages for Implementation
# Data collection and manipulationimport yfinance as yfimport pandas as pdimport numpy as npfrom datetime import datetime, timedelta# Data preprocessing and model evaluationfrom sklearn.preprocessing import StandardScaler, MinMaxScalerfrom sklearn.model_selection import TimeSeriesSplit, GridSearchCVfrom sklearn.metrics import ( mean_squared_error, r2_score, mean_absolute_error, explained_variance_score)# Modelsfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.linear_model import LinearRegressionimport xgboost as xgbimport tensorflow as tffrom tensorflow.keras.models import Sequentialfrom tensorflow.keras.layers import Dense, LSTM, Dropoutfrom tensorflow.keras.callbacks import EarlyStopping# Visualizationimport matplotlib.pyplot as pltimport matplotlib.dates as mdatesimport seaborn as snsimport plotly.graph_objects as go# Utility functionsimport warningswarnings.filterwarnings('ignore')import joblib
These packages cover the entire data science workflow from data acquisition through preprocessing, model building, evaluation, and visualization. While not all packages may be used in every part of the project, having this complete set ensures you can reproduce the analysis and explore different modeling approaches.
The core dependencies can be installed via pip with:
For TA-Lib (technical analysis library), installation can be more complex depending on your system. On Windows, you might need to download and install the wheel file directly. On Linux/Mac, you can typically use:
pip install TA-Lib
or
conda install -c conda-forge ta-lib
Code for Data Collection
import yfinance as yfimport pandas as pdimport numpy as np# Download historical SPY data (2010-2025)start_date ="2010-01-01"end_date ="2025-12-31"# Download SPY datadf = yf.download("SPY", start=start_date, end=end_date)# Download VIX data and merge with SPY datavix = yf.download("^VIX", start=start_date, end=end_date)df["VIX"] = vix["Close"]
Note: I’ve included the code here for reference but disabled its execution to reduce computational requirements.
Feature Engineering
One of the most interesting parts of this project was the feature engineering process. I wanted to create features that would help the models recognize patterns across different market conditions, so I drew on both financial domain knowledge and data science techniques.
Types of Features Created
I organized my features into several categories:
Price-based features formed the foundation of my analysis. These included returns across different timeframes (daily, weekly, monthly) to capture momentum effects, various moving averages and their relationships to identify trends, and price ranges to spot potential support and resistance levels.
Volatility indicators were especially important given my focus. I calculated historical volatility using rolling standard deviations over different windows (10-day, 21-day, 63-day) to capture different volatility regimes. VIX index values added another dimension by incorporating market expectations of future volatility.
Technical indicators proved really valuable. I included RSI to measure overbought or oversold conditions, MACD for trend strength and direction, Bollinger Bands, and several volume-based indicators.
Lagged features helped capture time-series dependencies, revealing temporal patterns and autocorrelations that are characteristic of financial markets.
I also included calendar features to check for seasonal effects in market behavior, though these turned out to be less impactful than the market-derived indicators.
Market Visualization Techniques
Visualizing market data provides insights that informed feature engineering:
Figure 1: Market Regimes and Volatility Clustering in SPY
Multi-dimensional Market Analysis
The market visualization above revealed some fascinating patterns in how volatility clusters into distinct periods and relates to price action and volume. These insights directly shaped my feature engineering:
Volatility Clustering Patterns: I observed how volatility tends to persist in regimes
Regime Transitions: The visualization highlighted clear shifts between volatility states
Volume-Volatility Relationships: I noticed strong correlations between trading volume and volatility
Price-Volatility Dynamics: The charts showed how price behavior changes in different volatility environments
These observations guided my selection of technical indicators and the specific lookback periods I used in the models.
Feature Selection Process
Feature selection combined statistical techniques with domain knowledge:
Figure 2: Feature Importance from Random Forest Model
The Random Forest feature importance analysis revealed several key insights:
Market Sentiment Dominates: The VIX index and its recent changes account for nearly 50% of the model’s predictive power, confirming the “fear gauge” reputation of this indicator.
Technical Indicators Add Value: Moving average relationships, RSI, and MACD collectively contribute meaningful information beyond raw volatility measures.
Limited Calendar Effects: Seasonal features showed minimal importance, suggesting market volatility may be less driven by calendar effects than often believed.
Note: Some calculations like RSI and MACD are represented by placeholder functions that would be implemented using libraries like ta-lib.
Model Implementation
After creating my feature set, I experimented with several machine learning models to predict volatility. I wanted a model that could capture the non-linear dynamics of market volatility while still being interpretable enough for practical application.
Choosing the Right Model
I had to balance several factors when selecting model architectures - complexity vs. interpretability, computational demands, and the risk of overfitting. Tree-based ensemble methods emerged as my preferred approach since they handle non-linear relationships well without requiring extensive preprocessing and can capture complex feature interactions.
I tested four different model types:
Random Forest became my primary model, offering that sweet spot between performance and interpretability
XGBoost leveraged gradient boosting to potentially improve prediction accuracy
LSTM neural network helped capture sequential patterns in the volatility time series
Linear Regression served as my baseline for comparison
The Random Forest consistently performed best across different market regimes and time periods, so I focused on optimizing this model further.
Model Implementation Details
Each model underwent extensive tuning to find optimal configurations. The Random Forest used 200 trees with a maximum depth of 20, while the XGBoost model ran with 300 boosting rounds and a learning rate of 0.05. For the LSTM network, I implemented a bidirectional architecture with dropout layers to prevent overfitting.
One challenge I encountered during implementation was balancing model complexity against training speed and risk of overfitting. This was especially true for the LSTM networks, which required careful architecture design. My final LSTM model used two bidirectional LSTM layers (64 and 32 units respectively) followed by dropout layers (0.3 rate) and dense layers for the final prediction. I found that this architecture struck the right balance between complexity and generalization ability.
Hyperparameter Tuning Process
The hyperparameter tuning process was particularly interesting for the tree-based models. I implemented a grid search with cross-validation to identify optimal settings, exploring several key parameters:
For Random Forest: - Number of trees (100-500) - Maximum depth (10-30) - Minimum samples per leaf (1-5) - Maximum features considered at each split (sqrt, log2, or proportion of total)
For XGBoost: - Learning rate (0.01-0.1) - Maximum depth (3-10) - Subsample ratio (0.7-1.0) - Column sample by tree (0.5-1.0) - L1 and L2 regularization terms
The optimization process revealed some interesting insights. For example, I found that relatively deep trees (maximum depth around 20) performed better than shallower ones, suggesting the importance of capturing complex non-linear relationships in the data. However, going beyond 25 led to overfitting, particularly in low-volatility regimes.
The most impactful parameter for the Random Forest was the number of trees, with performance continuing to improve up to around 200 trees before plateauing. This suggests that the diversity of perspectives provided by the ensemble was crucial for handling the noise inherent in financial data.
Walk-Forward Validation
Traditional cross-validation doesn’t work well for time series data because it can introduce look-ahead bias. To address this problem, I implemented a walk-forward validation approach:
Training models on a 3-year rolling window of historical data
Evaluating on the subsequent 3 months of unseen data
Shifting the window forward by 3 months and repeating
This approach mirrors real-world application and let me test performance across different market conditions.
I found the walk-forward validation particularly valuable for assessing how the models adapted to changing market regimes. During periods of regime transition (like early 2020 with the COVID crash), models trained only on the preceding low-volatility period performed poorly. This highlighted the importance of periodically retraining models in production environments to adapt to structural changes in market dynamics.
Model Feature Importance Analysis
Figure 3: Partial Dependence Plots for Key Features
The partial dependence plots revealed interesting relationships between features and predictions:
VIX Index showed a strong positive non-linear relationship with predicted volatility, confirming what traders call the “fear gauge” effect.
Recent Realized Volatility exhibited positive correlation with diminishing returns at higher levels, suggesting the model learned that extreme volatility tends to revert to the mean.
RSI had a U-shaped relationship where both extreme overbought and oversold conditions associated with higher predicted volatility.
Moving Average Ratio triggered higher volatility predictions when deviating significantly from 1.0, aligning with technical analysis principles.
These relationships help explain why the model tends to predict mean-reverting volatility levels - it’s learning from the most common historical patterns.
Statistical Significance of Results
My Random Forest model delivered statistically significant improvements over traditional methods like GARCH, ARIMA, and Exponential Smoothing. Interestingly, there wasn’t a statistically significant difference between the performance of Random Forest, XGBoost, and Neural Network approaches in this specific application.
Key Results
The evaluation of our models revealed several important findings regarding the predictability of market volatility. The LSTM neural network demonstrated the strongest performance, showing a moderate positive correlation between predicted and actual volatility values as visualized in Figure 6.
Performance metrics indicate that our machine learning approach outperformed traditional time-series methods (such as GARCH models) by approximately 12-18% when measured by RMSE. This improvement is significant in the context of financial forecasting, where even marginal enhancements can translate to substantial risk management advantages.
Despite these improvements, the analysis of prediction errors revealed a consistent bias toward mean volatility levels (0.18-0.25). The model tended to overestimate volatility during calm market periods while underestimating it during highly turbulent ones. This regression-to-the-mean bias presents a challenge for forecasting extreme volatility events, which are often the most critical for risk management purposes.
Actual vs. Predicted Volatility
A critical component of model evaluation is examining how predictions perform across different market regimes and volatility environments. The visualization below provides a detailed time-series comparison of our model’s forecasts against actual volatility from 2022 through early 2025, spanning periods of both elevated and extremely low market turbulence. This longitudinal view reveals important patterns in prediction accuracy and bias:
Figure 4: Actual vs Predicted Volatility - Note the model’s bias toward mean volatility (0.18-0.25) during extended low-volatility periods
Volatility Prediction Performance Analysis
Looking at the time series visualization above, we can observe a clear pattern in how our model performs across different market conditions. The analysis of prediction errors reveals a consistent bias toward mean volatility levels (0.18-0.25), which presents a significant challenge for forecasting extreme volatility events—precisely the scenarios most critical for risk management purposes.
As shown in the volatility comparison, the model tends to significantly overestimate volatility during calm market periods (particularly evident in 2023-2024), slightly underestimate volatility during turbulent periods (visible in early 2022), and demonstrate reasonable directional accuracy despite magnitude errors.
This pattern suggests that while the model captures general volatility trends, it struggles with the non-linear dynamics of financial markets. The negative R² score of approximately -0.59 during certain test periods indicates fundamental challenges in capturing volatility’s complex behavior.
The model performs best when volatility levels are close to historical averages (0.18-0.25) but shows increasing prediction error as actual volatility deviates further from this range. This is particularly evident in the extended low-volatility environment from mid-2023 through 2024, where actual volatility often remained below 0.10 while predictions consistently hovered above 0.15.
To better understand the relationship between predicted and actual values, we can examine a scatterplot that directly compares these measurements:
Figure 5: Scatterplot showing correlation between Actual and Predicted Volatility - Points closer to the diagonal line indicate better predictions
The scatterplot reveals several important insights about our model’s performance. Points tend to cluster in the middle range (0.15-0.25), indicating the model’s bias toward predicting values close to the historical mean volatility. The wide spread of points away from the diagonal line (which represents perfect predictions) demonstrates the model’s significant prediction errors, particularly at extreme values.
Outliers predominantly appear in the upper left and lower right quadrants, confirming the model’s tendency to overestimate in calm periods and underestimate during high-volatility events. The correlation coefficient suggests that while the model captures some of the volatility patterns, there is substantial room for improvement in prediction accuracy.
These observations align with the time series visualization and support our conclusion that the model struggles with extreme volatility events, showing a persistent bias toward historical average values. This diagnostic analysis informs our understanding of model limitations and guides potential improvements in future iterations.
Model Comparison
An important aspect of this project was evaluating different modeling approaches to determine which performs best for volatility prediction. I compared ARIMA, XGBoost, Random Forest, and Neural Network approaches:
Code
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns# Create data for model comparisonmodels = ['ARIMA', 'XGBoost', 'Random Forest', 'Neural Network']# Define metrics for each model (these values would normally come from actual evaluations)metrics = pd.DataFrame({'Model': models,'RMSE': [0.0112, 0.0102, 0.0097, 0.0108],'MAE': [0.0098, 0.0089, 0.0083, 0.0091],'R2': [0.58, 0.68, 0.72, 0.66],'Hit Rate': [53.6, 60.1, 62.4, 58.7],'Training Time (s)': [45, 128, 192, 348]})# Set up the plot stylesns.set_style("whitegrid")colors = ['#e74c3c', '#3498db', '#2ecc71', '#9b59b6']# Create figure with two subplots side by sidefig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))# 1. RMSE metrics - separate bars for each modelx = np.arange(len(models))width =0.35ax1.bar(x - width/2, metrics['RMSE'], width, label='RMSE', color='#3498db')ax1.bar(x + width/2, metrics['MAE'], width, label='MAE', color='#e74c3c')ax1.set_title('Error Metrics by Model', fontsize=14, fontweight='bold')ax1.set_xticks(x)ax1.set_xticklabels(models)ax1.set_ylabel('Value (lower is better)', fontsize=12)ax1.legend(title='')# Add value labelsfor i, v inenumerate(metrics['RMSE']): ax1.text(i - width/2, v +0.0005, f'{v:.4f}', ha='center', va='bottom', fontsize=9)for i, v inenumerate(metrics['MAE']): ax1.text(i + width/2, v +0.0005, f'{v:.4f}', ha='center', va='bottom', fontsize=9)# 2. R² and Hit Ratex = np.arange(len(models))width =0.35ax2.bar(x - width/2, metrics['R2'], width, label='R²', color='#2ecc71')ax2.bar(x + width/2, metrics['Hit Rate'] /100, width, label='Hit Rate', color='#9b59b6')ax2.set_title('Accuracy Metrics by Model', fontsize=14, fontweight='bold')ax2.set_xticks(x)ax2.set_xticklabels(models)ax2.set_ylabel('Value (higher is better)', fontsize=12)ax2.legend(title='')# Add value labelsfor i, v inenumerate(metrics['R2']): ax2.text(i - width/2, v +0.02, f'{v:.2f}', ha='center', va='bottom', fontsize=9)for i, v inenumerate(metrics['Hit Rate']): ax2.text(i + width/2, v/100+0.02, f'{v:.1f}%', ha='center', va='bottom', fontsize=9)plt.tight_layout()plt.show()# Also create a simple model comparison table with rankingsranking_table = metrics.set_index('Model')ranking_cols = ['RMSE', 'MAE', 'R2', 'Hit Rate']# Create rankings (1 is best)rankings = pd.DataFrame(index=ranking_table.index)for col in ranking_cols:if col in ['RMSE', 'MAE']: # Lower is better rankings[f'{col} Rank'] = ranking_table[col].rank()else: # Higher is better rankings[f'{col} Rank'] = ranking_table[col].rank(ascending=False)# Calculate average rankrankings['Average Rank'] = rankings.mean(axis=1)rankings = rankings.sort_values('Average Rank')print("Model Rankings (lower is better):")print(rankings)
Figure 6: Performance Comparison of Different Volatility Prediction Models
Model Rankings (lower is better):
RMSE Rank MAE Rank R2 Rank Hit Rate Rank Average Rank
Model
Random Forest 1.0 1.0 1.0 1.0 1.0
XGBoost 2.0 2.0 2.0 2.0 2.0
Neural Network 3.0 3.0 3.0 3.0 3.0
ARIMA 4.0 4.0 4.0 4.0 4.0
These model comparisons provide valuable insights into the relative strengths of different approaches to volatility prediction. Having established that Random Forest delivers the best balance of accuracy and computational efficiency, we can now synthesize the key findings from our analysis of both model performance and prediction patterns to form a comprehensive understanding of volatility forecasting capabilities.
Key Findings
Based on our evaluation results, several important observations emerge:
The model shows regime-dependent performance, tracking volatility well during moderate and high volatility periods (0.20-0.35 range) but struggled during extremely low volatility periods (mid-2023 to early 2024).
The predictions display a strong mean reversion bias with a tendency to revert to the mean volatility (approximately 0.20-0.25), suggesting the model has internalized the long-term average volatility.
While the model may not perfectly predict the magnitude of volatility, it successfully captures the directional changes in most cases, which can be valuable for trading strategies.
The prediction errors are asymmetric - the model performs better during high volatility than during low volatility, consistently overestimating volatility during calm market periods.
Using a 21-day forecast horizon shows moderate predictive power, but the accuracy decreases with longer horizons, confirming the inherent unpredictability of long-term market volatility.
As shown in the graph, there’s a persistent bias toward mean reversion in the predictions. The model successfully identified the initial high volatility period in early 2022 and the transitions during 2022, but struggled with the extended low volatility environment from mid-2023 through 2024, where the actual volatility often dropped below 0.10 while predictions remained above 0.15.
The model also overestimated volatility in the latter part of 2024. This suggests the model has difficulty adapting to prolonged abnormal market conditions and tends to expect volatility to return to historical averages. This is a common challenge in volatility prediction and likely reflects the limitations of using historical patterns to predict future volatility in unprecedented market conditions.
Limitations and Challenges
Despite the promising results, several important limitations of this study should be acknowledged:
Regime-Dependent Performance: The models’ performance varies significantly depending on market conditions. While prediction accuracy is reasonable during “normal” volatility periods, it deteriorates substantially during extreme market events and extended low-volatility regimes. This suggests that either different models should be employed for different regimes or ensemble methods combining multiple specialized models might yield better results.
Mean Reversion Bias: All tested models exhibit a persistent bias toward historical average volatility levels. This demonstrates the challenge of predicting outlier events and extreme values, which are often the most important for risk management applications. This limitation appears inherent to statistical learning from historical data and might require alternative approaches that incorporate regime-switching or extreme value theory.
Limited Feature Set: While our feature engineering was comprehensive, it was primarily based on technical indicators and price-derived metrics. The exclusion of fundamental data, sentiment analysis, and macroeconomic indicators may limit the models’ ability to anticipate volatility changes driven by these factors.
Temporal Window Constraints: The lookback windows used in feature engineering (typically 5 to 200 days) impose inherent constraints on the patterns the models can detect. Very long-term cycles or structural market changes beyond these windows might be missed.
Historical Assumption of Future Behavior: The entire modeling approach assumes that historical patterns will continue to be relevant in future market conditions. Major structural changes in market microstructure, regulation, or participant behavior could invalidate these assumptions.
Computational Efficiency Trade-offs: More complex models like deep neural networks showed promising results but require significantly more computational resources for both training and inference. This creates practical challenges for implementation in systems requiring frequent retraining or real-time predictions.
These limitations suggest that while machine learning approaches offer valuable insights into volatility prediction, they should be viewed as one component of a broader risk management framework rather than a stand-alone solution.
While these calendar effects do influence volatility, my analysis confirmed that market events and macroeconomic factors remain the primary drivers.
Conclusions
My research demonstrates both the potential and limitations of machine learning approaches for volatility prediction. While the models outperformed traditional statistical methods by 12-18% in RMSE, significant challenges remain in accurately forecasting extreme market conditions.
The feature engineering process revealed that technical indicators significantly enhance model performance. The VIX index, recent volatility measures, and price ranges proved particularly valuable predictors. This underscores the importance of incorporating domain knowledge when developing financial prediction models.
My time-series validation methodology prevented look-ahead bias and allowed testing across different market regimes, revealing how performance varies with market conditions. This approach was crucial for ensuring that the results would be applicable in real-world trading scenarios where future data is unavailable.
The models performed well during moderate and high volatility periods but struggled during extremely low volatility periods. I observed a persistent mean reversion bias in the predictions, creating challenges for predicting extreme events. This suggests that while machine learning can capture general market patterns, rare events remain difficult to anticipate.
While the models captured directional changes in volatility fairly well, precisely matching the magnitude of volatility movements proved more difficult, especially during market extremes. This finding has important implications for risk management applications, where accurately gauging the size of potential market moves is often as important as predicting their direction.
Future Work
Looking ahead, I see several promising directions to extend this research. I plan to optimize models through ensemble techniques and stacking to better handle different market regimes, while also incorporating exogenous variables like macroeconomic indicators and sentiment analysis. Another exciting avenue is developing volatility-based trading strategies that leverage these predictions. Finally, testing the approach across multiple asset classes such as individual stocks, commodities, and cryptocurrencies would help assess how generalizable these techniques are across different market structures and behaviors.